Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for 'size' in EQL Sample queries #87846

Conversation

luigidellaquila
Copy link
Contributor

EQL queries allow to pass a "size" parameter to limit the number of returned results:

{
  "size": 3,
  "query": "<an EQL query>"
}

This PR adds support for this parameter on Sample queries.

@luigidellaquila luigidellaquila marked this pull request as ready for review June 21, 2022 09:19
@elasticmachine elasticmachine added the Team:QL (Deprecated) Meta label for query languages team label Jun 21, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-ql (Team:QL)

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments

@@ -15,21 +15,28 @@

public class Sample extends AbstractJoin {

public Sample(Source source, List<KeyedFilter> queries) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to override hashCode and equals to include limit as well.

@@ -24,20 +25,22 @@
public class SampleExec extends PhysicalPlan {

private final List<List<Attribute>> keys;
private final Limit limit;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not an Integer here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I did at first, but then I realized that a Limit could have a few advantages:

  • it's consistent with the other EQL command implementations (eg. SequenceMatcher uses a Limit as well)
  • it anticipates the need for pagination, that is pretty natural in most scenarios (see also here)
  • it doesn't really add any significant complication or overhead

Anyway, it's not strictly necessary for the scope of this PR, so let's discuss it; if there is no agreement, I can easily reduce it to a number.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limit it is.

@@ -24,20 +25,22 @@
public class SampleExec extends PhysicalPlan {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, again, hashCode and equals methods updated.

@@ -174,8 +178,16 @@ private void finalStep(ActionListener<Payload> listener) {
if (docGroupsCounter == maxCriteria) {
List<SearchHit> match = matchSample(sample, maxCriteria);
if (match != null) {
finalSamples.add(match);
samples.add(new Sample(sampleKeys.get(responseIndex / maxCriteria), match));
if (samplesDiscarded < limit.offset()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WHy do you use both offset and limit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we have Limit (see previous comment), IMHO it makes sense to have a complete support for it.
The logic is trivial and the added value could be significant: with a small additional effort on syntax, we'll have at least a basic support for pagination.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why not having Limit as a node in the logical plan tree (vs a property of Sample plan)? (like SQL and EQL do it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, checking

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pagination won't work without state.
Without reading the comments first, I also found this code confusing and had to check if we provide a non-null offset anywhere. I guess this could be clarified with a comment, but not sure where such a comment should be placed (if we ever add an offset, a comment placed here will nearly certainly not be updated/removed).
IMO, since the logic is trivial, it could be added when actually needed. But I guess it could also be kept.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the functionality is not there, the code should be removed. Let's not add code for functionality that is to be implemented sometime in the future or, worse, not at all. It will only confuse anyone reading the code.

- LimitWithOffset in the logical plan tree
- proper equals()/hashCode()
@@ -102,7 +102,13 @@ public Object visitStatement(StatementContext ctx) {
if (ctx.pipe().size() > 0) {
throw new ParsingException(source(ctx.pipe().get(0)), "Samples do not support pipes yet");
}
return plan;
LimitWithOffset limitPlan = new LimitWithOffset(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit pick

Suggested change
LimitWithOffset limitPlan = new LimitWithOffset(
return new LimitWithOffset(

@@ -102,7 +102,13 @@ public Object visitStatement(StatementContext ctx) {
if (ctx.pipe().size() > 0) {
throw new ParsingException(source(ctx.pipe().get(0)), "Samples do not support pipes yet");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't support for pipes now also be possible since limit and offset are both supported?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because the limit/size comes from a request parameter, not from the query itself (through the query parser).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it works with sequence, doesn't it? why would sample be different in this regard?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With sequence you have | head 5 | tail 3 | head 1. How do you do this with samples?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can do sequence [any where true] [any where true] | head 10 | tail 2 why can't you do something like sample [any where true] [any where true] | head 10 | tail 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see HEAD and TAIL as very natural operators for sequences, that have well defined semantics defined by their order (eg. TAIL 10: give me last ten - more recent - events).
Samples, on the other hand, do not have a natural order: "last ten samples" is not a clear concept by itself.
In this context, LIMIT/OFFSET is much more similar to the SQL equivalent, where you just need a way to retrieve the results in batches or to show them in a UI in pages.
I agree with @Luegg on the fact that we could support pipes here eventually, but we probably need some slightly different syntax, eg | LIMIT 10 OFFSET 20; or reuse HEAD with | OFFSET 20 | HEAD 10, though it seems less natural in this context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Luegg, as @luigidellaquila well put this, re-using HEAD and TAIL doesn't reflect the nature of samples which do not have a chronological ordering concept, like sequences do. The only type of ordering that make sense is the one on the join keys, but at that point the simple concepts of HEAD and TAIL are not enough to express the multitude of ordering options available (order by first join key asc, order by the second join key desc, order by the third join key asc and so on; or order by all keys asc, or all keys desc, but at that point one would ask the question why cannot I order by each key individually etc). At this point in the evolution of samples, I believe having a circuit breaker and a simple way of limiting (without computing all the samples) the number of results returned, is enough. With time, and more input from users, we can evaluate our options in providing something more advanced and more granular in terms of limiting the number of results and, at the same time, sorting the results.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see it as reusing head and tail for sample. It's more like making sample work with the other language constructs as sample is the only construct that can not be fed into pipes. In this perspective, head can just fetch the first x results independent of how the order is specified. I agree though that tail might be a bit more work because it requires to reverse the sort order and that's probably better kept out of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Luegg I agree with you, tail is definitely not straight forward at this stage.
Also for a proper implementation of pipes on samples (and limit/offset in particular), I think we still need to define a few small details that do not really fit in the scope of this PR.
For this specific one, I'd suggest to stick to the original goal, but keeping a door open for further improvements.

@luigidellaquila
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/part-2

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were two approaches in this PR regarding size:

  • a Sample having a size attribute, which I don't believe it's something that the sample node must have
  • a Limit logical plan with a Sample logical plan child node

In the latter case (the one in the PR at this point), transforming the logical plan into a physical one doesn't follow the usual steps. For example, the Mapper is the one that generates physical plans based on the logical ones and, according to what's in there already a LimitWithOffset should map to a LimitWithOffsetExec. In your case, SampleExec physical plan includes a Limit, but it should in fact belong to a physical plan having as parent a LimitWithOffsetExec.

@@ -250,3 +250,61 @@ expected_event_ids = [18,11]
join_keys = ["doom","win10"]


[[queries]]
name = "size0"
size = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other eql queries use size as well, but none of them require the toml tests to have size in them. I'm wondering if there isn't any other way to test size given that so far there was no need for size in IT toml tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, the Mapper is the one that generates physical plans based on the logical ones and, according to what's in there already a LimitWithOffset should map to a LimitWithOffsetExec. In your case, SampleExec physical plan includes a Limit, but it should in fact belong to a physical plan having as parent a LimitWithOffsetExec.

I'm not sure I got this comment: LimitWithOffset+Sample logical plan is transformed to LimitWithOffsetExec+SampleExec, then QueryFolder.FoldLimit rule applies the limit (from LimitWithOffsetExec) to SampleExec, that is then used to build SampleIterator.
This is the exact same thing that happens for Sequence execution planning.

About "size" in the tests, we have unit tests (mostly for the execution planning, plus some specific tests for the algorithms) in Java, but they do not fully execute any query (mocking the full execution is extremely convoluted due to the usage of internal - Server - components in the execution, that cannot be instantiated in our tests).
The only real execution tests we have are TOML tests, this is why I added a size attribute there. It allows us to test the size on a full execution path.
In addition, I think also Sequence tests can take advantage of it.

Copy link
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mark-vieira mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022
idegtiarenko and others added 5 commits August 8, 2022 09:13
This is required in case new shards allocator might be more proactive with
rebalancing.
Clarify that the limits in the docs are absolute maxima that will avoid
things just breaking but won't necessarily give great performance.
If the date histogram interval is large and the 'fixed_interval'
parameter is very small we might end up with a large number of
buckets in the resulting histogram, in case we also generate empty
buckets. As a result of this we might generate too many buckets
(max date - min date) / fixed_interval > 65536 (roughly)..

Here we set minDocCount to 1 so to avoid generation of empty buckets.
In the test the maximum value for 'docCount' is 9000 which means,
in the worst case we generate 9000 documents, each belonging to a
different bucket. In this case we would have 9000 buckets maximum
which is well below the default maximum number of buckets allowed by
default.
mark-vieira and others added 21 commits August 25, 2022 00:14
Follow up to elastic#89517 to ensure we don't mistakenly pass the
`xpack.ml.enabled` flag on versions of ES that have no such setting.
We can use try-with-resources here, and remove a null check because IndexShard.get
never returns null.
For large snapshot clusters the final cleanup step that removes
previous shard metadata might have to delete tens of thousands of blobs.
This can take many seconds, for which the repository is needlessly blocked
from running subsequent finalizations or deletes.
Similar to how the final cleanup step was made async and taken out of the
finalization loop for deletes in elastic#86514 this PR moves the final
cleanup step for finalization async.
**The issue** The flaky test simulates the following: - Create a shrink
policy with an invalid target shard count - Then change the policy to
have a valid target shard count - Expectation: the
`check-target-shards-count` will return true and the shrink operation
will be successful.

What was happening in some cases in the background: - Create the shrink
policy with an invalid target shard count - The
`check-target-shards-count` gets created and queued to be executed with
the invalid target shards count. - The task doesn't get enough priority
to be executed - We change the policy to have a valid target shards
count - We execute the queued task which still has the outdated target
shard count.

**Proof** We enriched the code with some extra logging to verify that
the above scenario is actually correct:

```
## Adding the check target shards to the executingTasks

[2022-08-08T18:02:52,824][INFO ][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] elastic#78460: Adding task to queue check if I can shrink to numberOfShards = 5
[2022-08-08T18:02:52,825][TRACE][o.e.x.i.h.ILMHistoryStore] [javaRestTest-0] queueing ILM history item for indexing [ilm-history-5]: [{"index":"index-zmmrkzfhht","policy":"policy-bEmKF","@timestamp":1659970972825,"index_age":12608,"success":true,"state":{"phase":"warm","phase_definition":"{\"policy\":\"policy-bEmKF\",\"phase_definition\":{\"min_age\":\"0ms\",\"actions\":{\"shrink\":{\"number_of_shards\":5}}},\"version\":1,\"modified_date_in_millis\":1659970962968}","action_time":"1659970968847","phase_time":"1659970966014","action":"shrink","step":"check-target-shards-count","creation_date":"1659970960217","step_time":"1659970972076"}}]

## We change the policy before even the condition is never even evaluated.

[2022-08-08T18:02:52,825][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [javaRestTest-0] updating index lifecycle policy [policy-bEmKF]
[2022-08-08T18:02:52,826][DEBUG][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-0] [index-zmmrkzfhht] updated policy [policy-bEmKF] contains the same phase step keys and can be refreshed
[2022-08-08T18:02:52,826][TRACE][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-0] [index-zmmrkzfhht] updating cached phase definition for policy [policy-bEmKF]
[2022-08-08T18:02:52,826][DEBUG][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-0] refreshed policy [policy-bEmKF] phase definition for [1] indices

## We check the condition for the first time but the target shard count is already outdated

[2022-08-08T18:02:53,406][ERROR][o.e.x.c.i.CheckTargetShardsCountStep] [javaRestTest-0] elastic#78460: Policy has different target number of shards in cluster state 2 vs what will be executed 5.
[2022-08-08T18:02:53,441][DEBUG][o.e.x.c.i.CheckTargetShardsCountStep] [javaRestTest-0] lifecycle action of policy [policy-bEmKF] for index [index-zmmrkzfhht] cannot make progress because the target shards count [5] must be a factor of the source index's shards count [4]
```

**Impact** We do not think that the impact is that big for production
clusters because there are many more cluster state updates. However, it
might cause some inconvenience to the users who fixed a policy and do
not see the effect as soon as they could have.

**The fix** Our proposed fix is to not provide the target shard count
upon task creation but to retrieve from the cluster state. This way we
ensure it will have the newest value.

**Future work** Currently for every cluster state we go through all the
indices and we check if any step needs to be executed. This doesn't
scale well. We would like to try to switch to a more efficient model
potentially with a cluster state observer. Issue will be created soon.

Resolves elastic#78460
As per elastic#66908, the setting now defaults to `True`, but it's not shown in the doc.  Can we please have the doc updated?
* [DOCS] Update add node section

* Include information for enrolling nodes
We never create a stored context that has transient headers to clear while at the same
time preserves response headers. Not supporting this path allows for splitting up
the logic into 3 distinct methods that are much easier to follow logically.

Also, this commit removes a few redundant threadlocal.get and set calls.
This commit changes the status code returned when the start
trained model deployment api times out from `500` to `408`.
In addition, we add validation that the timeout must be positive.

Relates elastic#89585
Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a small nitpick.

What happens if both size and limit are specified?

@@ -46,6 +46,7 @@ public abstract class BaseEqlSpecTestCase extends RemoteClusterAwareEqlRestTestC
* For now, every value will be converted to a String.
*/
private final String[] joinKeys;
private final Integer size;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only one nitpick around the usage of Integer (object) instead of a primitive.
The sign doesn't seem to be used (head and tail take care of where the limit is applied) hence why not use the negative value as indicator that the size is out of band and thus unspecified?
As Integer, size has 3 states - valid if >=0, unspecified if null and invalid if <0.
What I'm saying is using only 2 through a primitive (int) - valid if >=0, unspecified if <0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if both size and limit are specified?

Do you mean something like ... | tail n with n different from size?
For samples, at this stage, pipes are not supported yet, so it cannot happen.
For sequences, both conditions are applied, so the smaller one prevails; this seems a reasonable behavior, so I'd say we could have the same for samples in the future

ywangd and others added 3 commits August 26, 2022 14:18
When DLS/FLS is enabled on an index for a user, the user will not be
able to perform operations that can make data content accessible under a
different name. This is intentional. This PR makes it clear in the doc.

doc preview link:
https://elasticsearch_89606.docs-preview.app.elstc.co/guide/en/elasticsearch/reference/master/security-limitations.html#field-document-limitations
@luigidellaquila luigidellaquila requested a review from a team as a code owner August 26, 2022 09:37
@luigidellaquila
Copy link
Contributor Author

Sorry for the noise here, it seems after a merge from main branch GitHub is struggling to re-calculate the diff with the head branch feature/eql_samples.

The PR shows ~200 commits ahead of the feature branch, but with a manual check it's clear that all of them are already in the feature branch and that both the feature branch and the PR are up to date with main, so this behavior is unexpected, most likely related to isaacs/github#750 and to our switch from master to main branch.

I'll give it a few hours to settle and hopefully re-calculate the diff. If it doesn't fix by itself, I'll close this and open a new PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/EQL EQL querying >enhancement Team:QL (Deprecated) Meta label for query languages team v8.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.